Repeated out-of-Africa expansions of Helicobacter pylori driven by replacement of deleterious mutations

Helicobacter pylori lives in the human stomach and has a population structure resembling that of its host. However, H. pylori from Europe and the Middle East trace substantially more ancestry from modern African populations than the humans that carry them. Here, we use a collection of Afro-Eurasian H. pylori genomes to show that this African ancestry is due to at least three distinct admixture events. H. pylori from East Asia, which have undergone little admixture, have accumulated many more non-synonymous mutations than African strains. European and Middle Eastern bacteria have elevated African ancestry at the sites of these mutations, implying selection to remove them during admixture. Simulations show that population fitness can be restored after bottlenecks by migration and subsequent admixture of small numbers of bacteria from non-bottlenecked populations. We conclude that recent spread of African DNA has been driven by deleterious mutations accumulated during the original out-of-Africa bottleneck.

For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated Our web collection on statistics for biologists contains articles on many of the points above.

Software and code
Policy information about availability of computer code For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability Field-specific reporting Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection. This study looks at the evolutionary history of a dataset of 716 Helicobacter pylori sequences sampled in different parts of the world.
A dataset of 716 Helicobacter pylori whole-genome sequences was assembled, consisting of 213 newly sequenced isolates from Europe, Asia and Africa ( MRI-based neuroimaging comprehensive mapping of ancestries within the area, the genomes from Central and North East Africa to have solid whole genome representation of the "Ancestral Europe" population. Lastly, the South East Asian genomes were chosen due to their unadmixed hpAsia2 background to serve as donors for hpAsia2 ancestry. For details on sample collection and bacterial isolation in the different cohorts, see Supplementary Methods The cohorts, sampling procedure and bacterial isolation is detailed in the Supplementary Methods section together with the procedures for DNA extraction, library preparation, sequencing and primary bioinformatics Sequences were sequenced in five different centres: Karolinska Institute (Sweden), Hannover Medical School (Germany), Hellenic Pasteur Institute (Greece), Oita University (Japan) and the University of Bath (UK). The primary bioinformatics analysis (trimming, filtering, quality check and assemblies) were also done separately.
The clinical samples were collected in several different cohorts over the last 30 years and were selected to represent geographical areas or human populations rather than reflecting a specific time interval. They are not necessarily a representative sample either geographically or pathologically since H. pylori requires endoscopy, an invasive medical intervention and therefore are collected opportunistically, normally from middle age people with some kind of gastric complaint. The details on collection year, where available, are now added to Supplementary Data 2.
Which genomes that have been used for what purposes and, if they have been excluded from some of the analyses, why that is, is detailed in Supplementary Data 5. We have also provided a statement pointing to this information in the Methods section.
Multiple isolates from the same geographical zone/population were sampled in order to assess the variability of the different measurements.
The group allocations, apart from geographical origin, were the H. pylori subpopulations, which were inferred from the analysis results (fineSTRUCTURE analysis, Supp. Figure 1.) As the categories in terms of population assignment were central to the downstream analyses and result interpretation, blinding of the group allocation was not suitable for this study.